feat(page-cluster): add frequency-based template/content token split#901
Merged
Conversation
Add computeDocumentFrequency() and splitTokensByFrequency(), a preprocessing layer that separates a page's shared site chrome (header/nav/footer) from its page-specific content by document frequency, before either half is compared with jaccardSimilarity(). A single flat Jaccard over a page's full token set has two failure modes: common chrome dilutes genuine content differences at loose similarity thresholds, and page-specific content variation (e.g. a freeform CMS block-editor page, where the exact block mix differs per page) swamps a real layout match. Splitting first and comparing each axis separately fixes both. Validated against two real crawls: a small single-layout corporate site (a few hundred pages) showed a clean bimodal frequency split stable across a wide threshold range; a much larger site that turned out to be a federation of independent sub-sections (no single section covering even half the pages) showed the split requires a homogeneous input, and recovers cleanly once scoped to one section. code-review (xhigh) surfaced 9 findings, all in the frequency-cutoff comparison: unvalidated threshold allowing degenerate cutoffs (0, NaN, or a percentage instead of a fraction), floating-point rounding at the documented inclusive boundary, and pageCount being passable out of sync with the documentFrequency it was computed from. Fixed by validating threshold eagerly, applying an epsilon tolerance to the boundary comparison, and bundling pageCount with documentFrequency into one DocumentFrequency result so they cannot be passed independently.
c9323a9 to
68acfe8
Compare
…uency-split # Conflicts: # cspell.json
This was referenced Jul 3, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
computeDocumentFrequency()andsplitTokensByFrequency()to@d-zero/page-cluster: a preprocessing layer that separates a page's shared site chrome (header/nav/footer) from page-specific content by document frequency, before either half is compared withjaccardSimilarity()(PR feat(page-cluster): add Jaccard similarity and array edit distance primitives #900, still open).Test plan
yarn build(28 projects)yarn lintyarn test(1130 tests)/code-review xhigh— 9 findings (all in the frequency-cutoff comparison: unvalidated threshold, floating-point boundary rounding, pageCount/documentFrequency desync risk), all fixed/qa-engineer— no additional findingsNote
PR #900 (
jaccardSimilarity/arrayEditDistance) is still open/unmerged; this branch has no compile-time dependency on it, but the two PRs are conceptually part of the same classifier-core preprocessing layer.🤖 Generated with Claude Code